Towards a Scalable and Robust Entity Resolution -Approximate Blocking with Semantic Constraints

نویسندگان

  • Mingyuan Cui
  • Qing Wang
  • Huizhi Liang
چکیده

Entity resolution, or record linkage, is the process that identifies data records over one or more datasets which refer to the same real world entity. To deal with large datasets, many real-life applications require scalable and high-quality entity resolution techniques. Blocking techniques can help to scale-up the entity resolution process. Locality sensitive hashing (LSH) is an approximate blocking approach that hashes objects within a certain distance into the same block with high probability. This technique can filter out records with low similarities, thus decreases the number of comparisons. However, the traditional approach only considers the textual or string similarity of records while the semantic similarity or constraints of records are ignored. This project is to propose and implement a framework that incorporates semantic constraints into the approximate blocking process to achieve scalable, high performance entity resolution. Firstly, minhashing based locality sensitive hashing methods are applied to generate minhash signatures based on the textual similarity of records. Then, for the semantic constraints, the whole domain knowledge of a dataset is extracted into a domain tree. After applying constraints functions according to a set of pre-set rules, a set of semantic signatures are generated. Then these two sets of signatures are combined to group the records into blocks. The experiments are conducted based on the Cora dataset. The results show that this framework makes blocking much more accurate, and in the meanwhile keeps high completeness.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Semantic Constraint and QoS-Aware Large-Scale Web Service Composition

Service-oriented architecture facilitates the running time of interactions by using business integration on the networks. Currently, web services are considered as the best option to provide Internet services. Due to an increasing number of Web users and the complexity of users’ queries, simple and atomic services are not able to meet the needs of users; and to provide complex services, it requ...

متن کامل

Adaptive Candidate Generation for Scalable Edge-discovery Tasks on Data Graphs

Several ‘edge-discovery’ applications over graph-based data models are known to have worst-case quadratic complexity, even if the discovered edges are sparse. One example is the generic link discovery problem between two graphs, which has invited research interest in several communities. Specific versions of this problem include link prediction in social networks, ontology alignment between met...

متن کامل

Intelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms

Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014